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> Proper adaptive routs n.g tn. a fo!ded--(dos network 
reduce:-; latency and provides; less varianet; in the 
disifibtit.ioi). of packet latency - which uit.ima.tely 
redue.es the global, synchronization tune in a multi- 
processor. 

> We show how noun in form? lies m the network traf- 
fic can he erected it ! a !o!de< l-Ch >s topology by the 



1 Introduction 

Interconnection networks are. widely used to connect 
procejssws and memori** jo mnltiproc rs 20,21], as 
-wit. ! iii" f thtt< - tot he h > nd )>iuii i uid-e<n<ht- • 
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and tot connecting i/O <;lev;uM>s j)y| A« petIV 
processor and memory continue to inercASi' i 
tiproefesoi cojnprti.er syst.cur ihe perforntatw 
interconnection network pioys a cent sat role 
mining the overall jjerfonnance of the system, 
tem y and t>an.h tdth ol th< n-1 oik Imep 
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the remote meniory ae,:ess fateney and l:»a!idwtdth. 

Recent advances in signaling technology have greatly 
increased the pin bandwidth available in a ro«t*r chip. 
This bandwidth is tnos': elTeeti.ve.iy us«l to redixte la- 
tetiev and eest by building high- radix routers with 



Mathe and pr. wide sigi tihVa nt !;v litehei throughput 

compared to oblivious routing. 

» We compare different allocation algorithms that 
can he used in adaptive routing on a fugh-radix 
network and compare their performance. We in- 
troduce randomization in the allocation algorithms 
to simplify the routing decision with minimal loss 



• We evaluate the cost of iitiplenwiiting adaptive 

routing in a high-radix, router, lb minimize the 
implementation cost, we show how reduced preei- 
t.-ii imphla tin <mp u i nt t<>< u n 1 / o. no 
lotion ot the allocations [nimmises the impact the 
n.n router pipeline delay. 

Tht: rt:st of the paper is organized a.s follows. In Se<s 
Uon 2, we provide b « kgn.un.i and elm uswrelatcd works 
on iopology and routing of a foldcd-Clos network. W«a 
compare adaptive and oblivious routing on a high-radix 

tive routing W<> discuss different allocation algorithms 
that can be used, in adaptive routing in Section 4. A cost 



comparison and technique to reduce the implementa- 
tion cost of adaptive routing is preset) ted in Section -a. 
Section t> presents conclusion and future works. 



2 Background and Related Works 

2.1 Topology 

A Clorf network is a multi-stage non-blocking network 
with an odd number of stages jaj . The network is equiv- 
alent to two ba.ek-tm back hulUnjiy networks where 

the last stage ol the input ne.iwo.tk is fuse..! With 'lie fttst 
stage of the output network. The input network etui 
mute from any input to my ■mMh~*fn<f<» r switch. The 
output network can route from any middle-stage -switch 
to any output. A -a-stagc (,;tus network with x nodes, us- 
ing radtx:-2 routers, is shown in Ft spue J. (a}. Because of 
packaging constraints a (Ids net work can he folded m 
that the input network and output netwoik share switch 
modules. A folded-Clos network is sometimes called a 
fat- tree {14]. The corresponding folded-Clos of the Clos 
network shown in Figure I fa) is shown in Figure iib). J 
The iolded-Cfos topology has been used in various dif- 
ferent networks including both circuit swai chine (C and 
packet switching (»\ 15) . Most, folded Clos networks use 
low-radix routers such as the CM--S network which uses 
radix- 8 routers and the Cray X 1)1 which uses a radix-- 2-4 
Meifonox i16| routers. By using FgeF-neg'x [outers it! a 
folded Clos network, the latency and the cost of the net- 
work cat) he lowered The Black Widow network 1201 
takes advantage of high-radix router ;).ud implements a 
modified high- radix folded Clos iwiwoik 

2.2 Routing 

For a given iop.si.sgy, an appropriate routing algorithm 
is needed to load-balance the traflie and minimize the 
latency. A rooting algorithm can be classified as ei- 

tht ) o'i < ' >i nil. it I in i lit tie d i iM« se )l !u id l m 

domfy or ndapi/m where the decisions ate made based 
on the network suite (e.g. queue depths along the 
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folded Clos or a fat-tree topology. The channel 
represent unidirectional channels while channel 
represent bidirectional channels. 



leeted during the input phase. The selection may he 
made using either oblivious or adaptive roofing. Dur- 
ing the output phase, the packet is touted front the se~ 

ieeted mid II. -.It s.ili It it . < mmmi mm - 1 01 1. U , 
destination output port. This routing is deterministic 
as there exists only a single path to the destination. 
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{Panting a packet, through a Oios network proceeds 
in two phases: input and output. 2 During the input, 
phase a middle-stage swift h is selected and the packet 
is routed to thai switch. For a folded-Clos topology, 
the packet need not route all the way to the middle 
stage but can stop as soon as a common ancestor o! the 
source and destination nodes is reached. Any middle- 
stage switch (or common ancestor switch! can he so 



the output phase 



ed ilia 



sol- 



atia] 



vides better performance than oblivious routing on the 
SF2 network [I]. However, their study was focused, on. a 
low-radix routers and limited analysis was provided on 
the benefits of adaptive routing over oblivious routing. 



Petrint and Vanrtescbt ilSj evaluated a family of fat - 
tr<xt- and -tttdt d the femcnm <>f vtrnud ohanm f s vntt 
adaptive routing but: do not eotnpate oblivious rooting 
with adapt ivtr routing 

\\< taamd the*- work by providing a better under- 
standing on the difference between adaptive and obliv- 
ious routing on a fokled-Clos network. Because of the 
enmplexitv of implement ing adaptive routing in a high- 
tadix router.. we discuss different implementations and 
compare then' performance and unplementat ion cost. 

Many studies h»ve been done with regard to the im- 
pact, of randomization. Mitzentna.eher [17; studied the 
impact of two choices in load balancing and Azar el 
ah |2j studied balanced allocation and how it can be 

fokfed t i - n u < ,1 . m d • i) < .mud J an 
tion problem since the packet car> traverse using any 
uplinks. Therefore, the routing needs to rnako an as- 
signment of the outputs to the input ports. We apply 
randomizm ion to ibe different allocation algorithms arid 
show thai most of the gain can be realized wit h only two 
choices \2\. However. cotit inning to increase the number 
<if random sample* can actually nvewee die performance 
in some allocation algorithm. 

3 Adaptive vs. Oblivious muting 

In this section, we compare adaptive and oblivious rout- 
ing on a high-radix foidedTTos network and discuss the 
benefits of adaptive routing. We show how adaptive 
routing is part icularly beneficial with limited buffering 
and nonunifonmties. Wy assume m ideal adaptive rout- 
ing in i Ins. sect. ton using the .><-;<jue-niAO.i allocation algo- 
rithm. The different allocation algorithms will be dis- 
cusses! in Section 4. 

3.1 Benefits of Adaptive Routing in High- 
radix Network 

The goal of adapt) ve rouf.ntg in a !dld<-al-( ! los network Is 
to load balance across tin- different, physical links during 
the uprouting to the comtrwii ancestor. Efficient adap- 
tive routing will minimize packet <.a//.;v<,,es that occur 
when multiple packets request the same output. These 
inlh-sai- willco u> ■■ .- uot mih< ie'tv ork and re- 
sult in higher fa.te.ney and tower throughput illy. 

eulan th« mi. . i n n oT hi Ug . ut adapts 
[outne / , , m<t obtmou lonriu / H U, [ t 
represents n<> benefit 0 I adaptive < ompuo d to oblivious 
routing while a lower ratio represents a higher benefit of 
tdapme Junius JTe odd nmubet oi -t , , - in ^ Nd< d 
f lo- netwmk< mbt upn - d t- J f \ hu < n 
stages in the optouting of the folded-Clos that is routed 
either adapt ivelv or obliviously and. x -f 1 stages in i.he 




Figure ,r. Ratio of latency using adaptive arid oblivious 
routing in a 4K node foldetkf dos net work as the number 
of stages in the network is varied.. For larger number of 
Ug< the radix of the routet increases and the srn die) 
ratio represents higher benefit of adaptive routing. 

dow.nn.iuuiiv drat must be touted (let enmriistieall y. The 
latenm dui-n i, tin tn J v - <i k < m then be e\pte--.cq u 

7U :::: i&idjQ^ 

'i'adttjA - V Q adapt T !, :! ' "t" IjQob 

where Q c .t at nuludin 
the queuing delay and the router pipeline efefey} of the 
router that is .touted oblivious and adapt) vely respec- 
tively. At high offered loads whore the queuing delay 
dominates the total delay. Q a <hpt. < Qob since adaptive 
o.iuiie.' -inempm u> o move ( ,»ma .-ti-m and mmunt/e 
any queuin dekn thus thr t itio I <„ m be 

simplified to ij^gr'y- 

We plot this ratio in Furore 2 and the ratio from simu- 
lations - : of a Ah node folded-Flos network at an offered 
load of 0.1)5. We vary the ta.di.x of the routers such, thai 
with radix- 128, only 3 stages {x i) are requited and 
with radix- F, '>■'• stages are needed. 

The simulation and theoretical ratio follow rhe same 
trend but the simulation ratio is higher because of 
the vartous assumptions made in the analysis. There 
is more benefit of adaptive routing as the number of 
stages m< it eyes resulting m tip to iok savings in fa- 
teuev High mdt\ iiet.weaks have fewer stages but there 
is still approximately aO/f. reduction in latency with. 
adapt:iv< rotitins ritis result and analyst asst.nned in- 
finite buffets at each router but we show ill the next 
section that when buffering is limited or in the pres- 
onee of nonuniformittos, the Ijeochf of atiaptive .ctiufiug 
is much greater in a nigh-radix network. 

3.2 Performance Evaluation 

In this section, we provide additional simulation results 
to M'tnput oFipt i , < ,,nd oi>h\ jntt- mmm \\ t use , 

s The simulation setup is dfjsf.riUw.i in Sts.tioti S.J 




m \ | R1 | | R31 




Figure 3: Block diagram of a IK node high-radix 5 « 
folded-Clos network with £ad):x-i>4 t outers. P0--PJ.023 
represents Jin- terminals, KO-dfsi represents the first 
level routers, and R32-.R&* represents the second level 
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Figure a : Latency disnf ! mi .ion of packets with an offered 
load of 0.9 \vjl;h (tt) oblivious ronting and <h) adaptive 
routing. 



pattern (Figure- -fie)). With a ooixgestiomtVee trnfffe 
pattern such as the bit complement permutation, 
adaptive con ting results in a cons-it.;) .i it delay regardless of 
the offeied. load. [I ! while congestion increases latency at 
higher offered load with oblivious routing f Figure 4(d)). 

3.2.2 Finite Buffers 

When buffeting is limited, the throughput o[' oblivious 
routing suffers compared to adaptive rooting as shown 
in Figure 4(b) when the input buffers are limited to 16 
entries. 5 ' Adaptive touting provides approximately 10% 
higher throughput as oblivious routing does not consider 
the stote of network - he. tlte randomly selected output 
maybe not be available because of jack of buffer space. 

The poot instantaneous toadd>ajancing of oblivious 
routing can be shown by a. .<iy.ap.fhot of the buffer uti- 
lization of the middle stage routers during simulation, 
lay,- each middle stage router, we pint the size, of the 
input h ,,■ with the highest u-o! pa j e-.y itt Figure (5 at 
an offered toad of 0.*. The average queue utilisation is 
under 1. for both routing n>.7H for oblivions and 0,.'J;5 for 
adaptive) but the distribution of the maximum queue 
occupancy is different. With oblivious routing, some of 
the buffers are filled, to capacity (16 entries) and average 
of the maximum buffer depth is approximately 5 entries, 
in. contrast, the avorarg of tin maximum buffet depth .is 
only e entries with adaptive rout tug and the maximum 
value is only 7. or roughly 50% of the capacity. Be- 
cause of tins imbalance in buffer "utilization, oblivious 
routut also lower 

throughput with limited buffering. 

3.2.3 Nontmifonnity - Presence of Deterministic 
Traffic 

Additional benefit < of tdaptlvt.; routing can fx obsei ved 
in the pres:ene< of nonun.iformity Because of the sym- 

5 if> buffer *i*ries are sufficient to cover the credit tmsav. 



Figure 6: A mapshwl of the maximum buffer size of 
the middle st on routers m I 

voik i h* dtujibnta.n -Town for uMnions md 
(b) adaptive, routing at an offered load of 0.8 and the 
buffet depth of the routers are 16 entries. 



metryiu the topology, raxnuniformUy can not be create. I 
by the traffic pattern itself since the traffic can be dis- 
tributed across: all the middle s togas. 

1 tie nonttniiot tnity can result: foam deterministic rout- 
ing being used in. the net work. I n a si su ed -memory muf- 
ti processor, it is often necessary to ensure ordering of 
requests to a given cache- hue memory address because 
of the memory consistency model. Therefore, determin- 
istic routing such as that used by the Cray BfeekWkiow 
network j20j, can he used to provide unorder delivery 
of alt .request packets at a cache-fine address granularity 
for a source-destination pair. Since response packets in. 
the network do not require ordering 1 Tey can be routed 
by using either oblivious or adaptive routing. By tak- 
ing into account tin: congestive state of the network", 
adaptive routing avoids any ttommiformitiea that may 
f>e introduced as a result of deterministic routing of the 
request traffic. Oblivious routing does not take into 
account these potential nonuniform)! ies and may ran- 
domly select an output port which baa a substantial 

feet, the adaptive routing will "smooth out"'" any nonuni- 
fottnitj.es in the: traffic pattern to load, balance the set of 
available .links, 

fo illustrate this nonuniformify. we simulate a traffic 
pattern where each node , sends mlr to (i + k) mod N 
where k is the radix of the routers and ;V is the to- 
tal number of nodes. Using this traffic pattern. 50% 
of the traffic is routed using deterministic routing and 
the remaining 50% of the traffic js routed ei titer adap- 
tive) y or oblivious! y, The simulation result is shown 
in Figure ?<»}. With oblivious routing, the through- 
put is limited to under 70% but. with adaptive routing, 
100% throughput can be achieved. The determinisfi- 
eatly routed traffic cause rmnuniformitm-J in the various 
different middh stages u the] need to route through 
the same middle stage. Adaptive routing allows the 
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for when, (a) half of the traffic 
routed iiMti;! deterministic, touting and (b) network 
id, faults, 



other trains to he routed around these noimnilonnities 
while oblivious routing can not avoid the nonuniformitv, 
leading jo lower I hroughpttt 



1 [ (lit f !< k d] > l mi < i . ! nh> s •! -t < c fold, d 

Clos network with a fault. By using oblivious routing, 
the uplinks to tlx. 1 middle stages are not load balanced. 

4 Allocation Algorithms in Adap- 
tive Routing 

Adaptive routing in a lojded.-Fios requires an alfomfion 
algorithm since the outputs (middle stages- need to he 
appropriately assigned to the inputs. We- discuss the 

different al loea tioti aFonthms find compare their per- 
formances ni [ [ 1(S section. 



3.2,4 Noruirtiforroity - Faults in the Network 

No.nuniibrniity can also be created by oblivious routing 
in the presence of faults in the network. An example 
is shown in Figure 8. Assume that the downlink from 
R4-*R0 is faulty 6 and we observe the we-U'E traffic gen- 
erated from nodes; connected to Hi, Arty traffic gener- 
ated for R0ea.it not he routed through ft-1 since they can 
not reach its destniatioi a. resulting in the traffic being 
load balanced across* the other three routers (1FFR7). 
However, traffic destined for R2 and log will be equally 
distributer! across all four uplinks. As a result .. the up- 
link from H i ->R-1 will be underutilized, while the of her 
three uplinks will be over utilized linn t ing the through- 
put of the network. Ib load balance appropriately, the 
traffic for h!2 and R2> should utilrxe the Rf-->R4 uplink 
more thai, the traffic is balanced. 

To evaluate the impact of faults on adaptive and 
oblivious routing, we simulate a faulty network with 
the l.K network shown in Figure g with approximately 
1 V f of the Hub between -Eogt 1 U! \-i )yj ionics It. 

of the 1.024 links} assumed to be faulty The simulation 
results on this network with the FB mdhie pattern 
is shown in Figure 7(b). By load balancing appropri- 
ately across the healthy links, adaptive routing leads to 
approximately 2/ bnprovement in thtoughpntd 



4.1 Algorithm Description 

l load balance in a. folded-Flos network, adaptive rout- 
ing selects the middle stage with the least amount of 

• .11 -t ion n tnidtdi, -t,e ■ >< )th t ft in t - f amount ol 
bulks mr, so, They 'bn<< -amuhu jon n d.-m v,tth input- 
queued routers, credit* from the downs! ream routers are 
used t<> load-balance Thus, middle stages with larger 
credits (more buffer space) is preferred over those with 
lower credits. 

We evaluate the following four allocation algorithms. 
Unless stated otherwise, ties (equal credit counts) are 
broken randomly, 

• .sege.ee/ W : Each input makes its adaptive deci- 
sion after inputs 0 ib tough r- f have made their de- 
cisions and updated Fie slate of the network. The 
allocation algorithm assigns outputs to each input 
one at a time, taking into account the previous al- 
locations of this cede. To provide fairness, starting 



input i 



ando. 



each e 



;<?<% : Each i 
(ependeni ly. 



ts an output. 
Hit does not 



- disabled. 



renting decision, W 
si.mi.lat to oblivious 



Figure 0: Adaptive routing comparisons with (a) infinite 
buffers and (h) 10 buffers using wc-WR taaffio pattern. 



Figure 10: The impart on the saturation throughput as 
fa) radix and (b) packet six* is varied using the greedy 
rouiui* ah on thin lit h i t i !i md -tn.dlt t p >. k<t -a,< 
limits the ihroieriiput of the ge.-aa% algorithm. 
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• grmdif.Tin) : A randomised version of 0 w>h, wn:F 
» Each input selects n random outputs 

and adaptive selects among Fa-m. «r«^r%; is 
identical to ereedy and flWi/.Wl) is identical to 

oblivions routing. 

Tn the sequential and the s«?werrf>'f)i..r(w.) allocation 
algorithms, if multiple outputs have the same credit 
count, the ties ate broken (andontlv. However, if one 
of the outputs with the mine credit count has already 
been selected hv a previous input in the same cycle, 
we provide priority to the other outputs. For exam- 
ple, if four outputs {Oo.X^Os.O;,] have a credit count 



higher at an offered load of 0.95. 

Tht .lillnuio m htuiiv h< t vec n , ,<!,, , {2 utd 
i fit r,fii t no m mi >d v uh Itmtod bulk im It 
tire 9(b)). With unlimited buffers. gre>idy..r{2} might 
randomly select a ted output - i.e. an output which 
have a lot of packets. However, with limited buffering, 
if bad outputs are selected which correspond to outputs 
that art 1 full, the packet will be stalled and allocation 
is re- attempted hi the nest, cycie and can avoid the hud 
outputs. Thus, the latency difference is less than 20% 
at an offered load, of 0.85 near saturation. 

Regardless, of the amount of buffering, the great,/ al - 
gorithm per!'- 
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4.2 Afgorithm Comparison 

We simulate the different allocation algorithms using 
the simulation setup described, in Section ?>. For tire 
randomised allocation algorithms, we use n 2, The 
allocation algorithms a.te compared in Figure fi using 
the we-FH traffic pattern the results for other traffic 
patterns follow the same trend. 

With mil 1 .urn-ides (he 

lam peitorntane. a- n l«'a<l-toila hnw o h,i<n<% hw 
tm 'bi | h< m i i 1 p 1 1 xne- >ntp u tl l< n 

, <; d< ,'bu! i. ^F t Ft hib iti It. t I <i.it ta t ,,i 
Umttoit apptoMftt iti 1 . Ill hi h t ,t ,,, t ,thtf iia.i .f 

0d*a). The gt <i<- Fg_%2 ] also provides the same through- 
put, but leads to higher latency, approximately 60% 
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inputs would select O,. This allocation would be an 
optimal local decision for the four inputs but does not 
globally load balance across all the output- and leads 
to a poor allocation. As a. result, congestion is created 
at: an output: and the poor allocation decision creates a 
head-of-line blocking effect \i'2\ and limits the throng-h- 
it is wealb not m Fm Ft ,< < <, d aohtn n not 
problematic with low-radix router* but: becomes prob- 
lematic with high-radix routers. The throughput of the 
grmdy algorithm is eorupaiml in Figure 1.0(a) oti a 4'K 



f the 



! radt 



■ litt: ] 



bo 



radix networks achieves almost 100% throughput but 
i:h< throu hptit ho] t minder so 1 o tl h i niu- !t> ,md 
beyond mdixw% Ftc throughput is under 60%. The 
packet. she also impacts the performance of greedy al- 



Figure 1L Probability of tvo:> or more packets arriving 
in the sarin: cycle as the radix and the packet size is 
varied. 

gorithrn as large packet, size- increases flu; throughput 
(Figure 10(h)). Since the routing decision is ma.de onlv 
lor ibe head flit of a packet, huge packet size cb;re3« ; s 
the probability of having 2 or more new packets arriv- 
ing in the same, cycle •• reducing the chance of out put 
collision from 1:}u> poor routing decision, in Figure ] i, 
we plot this probability as a function of offered load. 
The probability approaches ! very quickly for radix- fet 
t'ouicr which explains the poor performance of bigb- 
radix routers for t be erexig algorithm using [-tin pack- 
ers. However, the probability gradually approaches 1 1v>r 
radix-8 router, With a packet size of S-jtUls in a radix- 
bj router, the probability jg reduced and behaves very 
simitar to a Eadix--8 router with I -flit packets. Titus, it 
is essentially the. ratio between the radix and the packet 
fengt.it that determines the performance of m'eo.b; algo- 
rithm. 

To evaluate the impact of the parameter «, we 
vary •) m the randomized allocation algorithms 
■srm:riitiisLr(!:} and grcrdy.rijr;) and plot the latency 
at an otf< o 1 I »ad .1 fl <> in 1 t ui< J J When ,* 1 the 
randomized allocation algorithms are identical to obliv- 
ious reading since only I randomly rejected output ts 
used. As ,•( increases horn ! to 2, there is approximately 
1.0% reduction in latency for greedy. ri ■?:»}. However, as n 
is tm tt istd luH In t tht j<n< n< 1 'i -f -uniinafith 
and is beyond the scale of the plot. As n approaches 
32, the allocation algorithm behaves like greedy and 
the network is no longer stable as the offered bad of 
0.9 exceeds the throughput 8 Similar to greedy jr(Vh 
*cqucniiaLr{l} behaves identical to oblivious rooting. 
!T inue^e <> item 1 to 1 ihn< e- en gn t tun ion 
in latency with .seem.- ■».(■!, v. j-( 1 ). However, increasing 
» fxott) 2 to ;'j2 results in less than i.0% reduction hi 
latency. Thus, most of the performance gain can be 



Figure 12: Randomized adaptive allocation algorithm 
comparison as r> is varied for sequeni4aU-(n) and 
greedy.r(ri}. The lower bound of the algorithm is shown 

by the seqtuudioi hue. 

, Routing Delay , 



Figure 13: Timeline of delay in acb.ptiv 
high-radix folded- Clos network. 



Kk» 



.pie, 



ins that the random choices are noi necessarily unique 
- e.g. for rnqmniiaU-iZ), the two randomly selected 
onfpnts can be the same output. Simulations show that 
the uniqueness; of the random choices hits minima! im- 
pact nri the tatf-mey of ihe allocation algorithm. As a 
result ,,>,■>'•! lafenev is do lit b higher than 
the si- fluent iai hut by less than )%. 



5 Cost Analysis 



Although adaptive renting provides performance ben- 
efits., the cost, of implementation complexity, m terms 
of routet latency and area, need to be considered. For 
deterministic routing or source routing where only bit 
manipulation is required or oblivious routing where only 
a random number needs to be generated, the routing 
pipeline delay will be minimal. However, earlier work 
showed that introducing adaptive touting can inn-ease 
the complexity and the cycle time of a router hi]. The 
larger number of ports in a high-radix roster can fur- 
ther increase the routing complexity. For example, the 
YARC router requires 4 clock cycles for the routing de- 
cision [20L 

A timeline of adaptive routing m a high- radix network 
is shown in Figure Hi The three main components to 
the latency of adapt ive routing in high-radix routers axe 

the .following: 

1 . Collecting t he network st ate bag, availability of the 



In this section, we provide a qualitative eomparison 
of the different adaptive allocation algorithms discussed 
tlx Section 4.1. Then, we evaluate two different tech- 
niques that reduce tin? eoirtplexit y of adaptive routing 
in a high radi x folded-C'los network. By using imprecise 
queue information, the complexity of the route alloca- 
tion can be reduced I .lie- pre. .amputation of the ailoea- 
tu ti- omdleuvi' hid. die touts t lat.nm oi id tpm< 
routing. We evaluate then' impact on performance and 
show that there j s nviuiutal performance toss. The use 
of ir» precision can be user! for all lot.tr of the algorithms 
but the pm:ompuisiion can only be used for qrc'.-dy and 
yrt-edy.:r{n) algorithm since tftey are distributed also- 
rithms. 
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fabk I: Differoni values of preeisioi 
ate the impact of precision in adapt! v 
depi h of 16 entries is assumed. 



, us«] to evahi- 
rotrtmg. Buffer 



5.1 Algorithm Cost Comparison 

Among the adaptive algorithms discussed in Section 4..), 
sequential requires a centralized routing structure that 
collects all of ike requests, pec forms (he allocation, and 
distribute rite results, 'The complexity of such routing 
structure grows as Oik") and becomes prohibitively ex- 
pensive to unpleinent. The qu-'v! >ai ..r{>i.} rot nine H \. 
gotithm even with u 2, stilt requires a central routing 
structure since each input is routed sequentially. 

The art.fd-ij algorithm does not: require a neutralised 
structure as the routing logic car. be duplicated at each 
input, and be distributed. However, the routing algo- 
rithm still requires the distribution of the output in- 
formation to all of the inputs, in addition., the com- 
parison logic at each input needs to compare ait the 
outputs, requiting a significant amount of logics The 
$rmdyjr{n) algorithm cm simplify t he implementation 
as only n comparisons need, to be made. Simulation 
earlier showed that n 2 performs well - thus, oniy a 
single comparison is needed, 

5.2 Precision in Adaptive Routing 

Tin? buffer depth (credits; is used to load-balance prop- 
erly in adaptive routing and results so tar assumed that 
the full informs tion was available - i.e. the exact buffer 
depth v is o niabh from tla redds lot « bull,, with n 
entries . g-g/ h m to needed, to obtain die exiK-a faille] 
information. However, loyoq bit contparators can be 
costly with the larger number of ports in a high-radix 



figure [ { Latency coutparsioo near saturation lor net- 
work simulations with we-fiR traffic using fa) 1 flit 
packets and lb} 10 Hit packets as the precision of al- 
location is varied . The network shown in Figure 3 was 
used for She simulations. 



router. As a result, the YAKC router only uses 1 or 2 
bits of information to make ks adaptive decision j20jv 
Table J. describes the different values of precisions 
that art! used to evaluate the impact of the precision. 
We assume « buffer depth of p; and use the *< f{11( „tial 
algorithm in otu evaluation, k'sing ti bits of mforrnaf ion 
corresponds to an allocation which does not consider 
the queue depth but. considers only whether an output 
js available or ttotd" An output ts not available if an- 
other input with a multi-flit packet is transmitting to 
the outprtt or if the downstream buffer is full By using 
only the most significant bp fofSi'fo the adaptive iniot- 



v.'iil ! 



, iff i 



* tha 



i:>uffer information is defitied in granubuiiy oi --i eni.ries 
- e.g. Oft corresponds to jess than 4 entries, Of. corre- 
sponds to 4 or more entries but less than 8 entries, and 
so forth. 'Using 3 bits results in a finer granularity and 
1 bits allows the- exact queue information to be used. 



"The u uttt, t i . a a He 
iu .■ It flu i » balic. »iui 2 
v: it h i a t.hs column. 

■ i;:; '7.'hif. adayaive allocation i: 



.-.ffte. 









H - 






J, 







Figure :i5: Impact of pre< 



with nonuniform traihe 



Figure !(>: Performance comparison with the use of pro 
computation. 



of the different pro* 
ween 0 bite and 4 h 
). With longer packe 



I bit* of p* 



ihi)?: 0 hits (Figure 14(b;; However, by using on.lv 2 
or 3 'bits of precision,, the latency can still be 'xeduced 
by over <luvf; compared to -using 0 bit*. 

For the uniform traffic patterns, the throughput is 
very simitar regardless of the precision used and differ- 
ence in iaieney near saturation was compared. How- 
ever, wit li the nonmuformuy such as the one discussed 
in Section 3.2.-4, the different precision results in differ- 
ent throughput as shown in Figure 15. Using; 0 bit of 
information si. ill outperforms oblivions routing (see Fig- 
ure 7(b)) but results in approximately I fog [eduction in 
throughput , : compared tu tiding all 4 hits. By using only 
2 bits of precision, the difference can be reduce by half 
and 3 bit - of po < i.sioit \.» i forms u< trh id< nix 3 to usm 
ii.lt 4 bits; (i f precision. 

5,3 Precomputatson 

To reduce the impact of adaptive routing on the router 
fatenm ttte alio, utou , an be .» ftl «. f« / IP uutm 
the queue in format. ion in t he previous cycle, t he routing 
decision can be prec.om puted and be available when a 
new packet arrives. The precouiputation of allocations 
Oed with minima] toss in performance lor the 



fol 



< The queue depth will change minimally irons cycle 
to cycle. 

. When full precision is not used, the change in the 
credit will have minimal impact -■ e.g. if only the 2 
bits are used for adaptive decisions, the change in 
the lower 2 bits will, have minima! Impact. 



. VVtt'.h the use of randomrmiom even if some of the 
data is utak, it might not impact the results. 



'Tin: performance comparison with preeomputai ion is 
-hov.it ni ii itre io e id ., 2 dyoruhm We 
compare the performance without precontputation to 
preeomputef where the output a; csfoufafert in t he pre- 
vious cycle, and precompitie2 where the output is cal- 
culated ni 2 cycle a.d"va.uce. The comparison is shown 
for the nonuniform traffic from Section 3.2.4 with a 10 
flit packets using only 2 bits of precision. Both pre- 
eomp'iiid and precompute2 pot fonn eomparabfo in no 
preeomputai ion. with preeomptrted trsntiting in approx- 
imately 10% higher latency near saturation. Willi min- 
imal loss in performance, preeompute2 will allow an ex- 
tra cycle to distribute the routing information - further 
minimising the impact i>f wires and reduce the router 
pipeline delay as the routing results can be computed 
in advance. 

6 Conclusion and Future Work 



•clo- 



the 



recent developments in high-radix: routers to redw* the 
lateuev and lite cost of the network. In this paper, we 
evaluate adaptive rouitie. on. a high-radix foided-Ck>s 
uetwork and compare ii to oblivious routing. We show- 
that- appropriate adaptive routing lowers latency and 
provides less variance in the packet, latency compared 
to oblivious routing. With limited buffering, adaptive 
routing achieves better buffer utilization and results in 
higher throughput, Nonuniform! ty can be created in 
the folded-Cios t opology with the presence of determin- 
isUcaUy routed traffic and faults in the network. In the 
presence of such nontinifonrtity adoptive routing pro- 
vides significaiil advantages as it will, "smooth OUt" the 
h did md pr< etch hi la i vhrouighpW 

Ue mofa- dnhi.m dh.. ,tio,t ,1 outhnn tiii.t can 
be used for adaptive touting ati<i compare their perfor- 
man.e.- I'ln t ,j < oh,>i d jornb m provides th< best 
performance but is expensive to implement. By using 



that the bnptomsniatkm <;&n be simplified with minimi 
per fiat nance loss. Van show (.hat the implementation 
complexity can be further reduced hv using imprecise 
queue information and the latency of adaptive routing' 
< i \n tndd t i ii i [ ( i i u h i 



.wits 



1 pt 



The intgta.tieui toward high- radix uw works open other 
opportunities tor future work. Adaptive muting has 
been shown to provide better buffer utilization in this 
paper. I hew v<-: it. still remains to he seen what: is the 
roost effective way to partition the buffers available ■■ 
e.g. partition the buffers into virtual channels, shaved 
buffeting scheme etc. In mkMi ion, imi appropriate flow 
control for higlera.dix networks needs to he evaluated. 
The diametei of the network is reduced with high-radix 
routers a.ncl it remains; to be seen how this impacts the 
flow eonSo -1. 
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